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Abstract 

We consider the matrix completion problem of recovering a structured matrix from noisy and par¬ 
tial measurements. Recent works have proposed tractable estimators with strong statistical guaran¬ 
tees for the case where the underlying matrix is low-rank, and the measurements consist of a subset, 
either of the exact individual entries, or of the entries perturbed by additive Gaussian noise, which 
is thus implicitly suited for thin-tailed continuous data. Arguably, common applications of matrix 
completion require estimators for (a) heterogeneous data-types, such as skewed-continuous, count, 
binary, etc., (b) for heterogeneous noise models (beyond Gaussian), which capture varied uncer¬ 
tainty in the measurements, and (c) heterogeneous structural constraints beyond low-rank, such 
as block-sparsity, or a superposition structure of low-rank plus elementwise sparseness, among 
others. In this paper, we provide a vastly unified framework for generalized matrix completion by 
considering a matrix completion setting wherein the matrix entries are sampled from any member 
of the rich family of exponential family distributions; and impose general structural constraints on 
the underlying matrix, as captured by a decomposable norm regularizer 72(.). We propose a simple 
convex regularized M-estimator for this generalized framework, and provide a unified and novel 
statistical analysis for this class of estimators. We finally corroborate our theoretical results on 
simulated datasets. 

Keywords: Matrix Completion, Exponential Families, High Dimensional Prediction, Low Rank 
Approximation, Nuclear Norm Minimization 


1. Introduction 


In the general problem of matrix completion, we seek to recover a structured matrix from noisy and 
partial measurements. This problem class encompasses a wide range of practically important appli¬ 
cations such as recommendation systems, recovering gene-protein interactions, and analyzing doc¬ 
ument collections in language processing, among others. In recent years, leveraging developments 
in sparse estimation and compressed sensing, there has been a surge of work on computationally 
tractable estimators with strong statistical guarantees, specifically for the setting where a subset of 
entries of a low -rank matrix are observ ed either deterministically, or pert urbed by addit i ve nois e 
that is Gaussia n Candes and Plan (2010), or more generally sub-Gaussian iKeshavan et al.1 (12010m : 
Ne gahba n and W ainwr ightj (2012). While such a Gaussian noise model is amenable to the subtle 
statistical analyses required for the ill-posed problem of matrix completion, it is not always practi¬ 
cally suitable for all data settings encountered in matrix completion problems. For instance, such a 
Gaussian error model might not be appropriate in recommender systems based on movie ratings that 
are either binary (likes or dislikes), or range over the integers one through five. The first question 
we ask in this paper is whether we can generalize the statistical estimators for matrix completion as 
well as their analyses to general noise models? Note that a noise model captures the uncertainty un¬ 
derlying the matrix measurements, and is thus an important component of the problem specification 
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given any application; and it is thus vital for broad applicability of the class of matrix completion 
estimators to extend to general noise models. 

Though this might seem like a narrow technical, although important question, it is related to a 
broader issue. A Gaussian observation model implicitly assumes the matrix values are continuous¬ 
valued (and that they are thin-tail-distributed). But in modern applications, matrix data span the 
gamut of heterogeneous data-types, for instance, skewed-continuous, categorical-discrete includ¬ 
ing binary, count-valued, among others. This, thus gives rise to the second question of whether we 
can generalize the standard matrix completion estimators and statistical analyses, suited for thin¬ 
tailed continuous data, to more heterogeneous data-types? Note that there has been some recent 
work for the specific case of binary data by Davenport et ah (120120 . but generalizations to other 
data-types and distributions is largely unexplored. 

The recent line of work on matrix completion, moreover, enforces the constraint that the un¬ 
derlying matrix be either exactly or approximately low-rank. Aside from the low-rank constraints, 
further assumptions to eliminate overly “s piky” matrices are required for well-posed recovery under 
partial measurements Candes and Recht (2009). Early work provided generalization error bounds 
for various lo w-rank matrix completio n algorithms, including a l gorithms based on nucle ar norm 


minimization 


Candes and Recht ( 2009 ); Candes and 


(201 lj), max-margin matrix factorization 


Srebro et al 


'ao (2010); Candes and Plan (2010); Recht 


(2004), spectral algorithms 


Keshavan et al. 


( 2010a Ibh. and alternating minimization Jain et al. ( 2013 1. These work made stringent matrix inco¬ 
herence assumptions to avoid “spiky” matrices^ These assumptions have been made less stringent 
in more recent results Negahban and Wainwright (2012J), which moreover extend the guarantees to 
approximately low-rank matrices. Such (approximate) low-rank structure is one instance of general 
structural constraints which are now understood to be necessary for consistent statistical estimation 
under high-dimensional settings (with very large number of parameters and very few observations). 
Note that the high-dimensional matrix completion problem is particularly ill-posed, since the mea¬ 
surements are typically both very local (e.g. individual matrix entries), and partial (e.g. covering a 
decaying fraction of entries of the entire matrix). However, the specific (approximately) low-rank 
structural constraint imposed in the past work on matrix completion does not capture the rich variety 
of other qualitatively different structural constraints such as row-sparseness, column-sparseness, or 
a superposition structure of low-rank plus elementwise sparseness, among others. For instance, in 
the classical introductory survey on matrix completion Laurent (2009), the authors discuss structural 
constraints of a contraction matrix , and a Euclidean distance matrix. Thus, the third question we ask 
in this paper is whether we can generalize the recent line of work on low-rank matrix completion 
to the more general structurally constrained case. 

In this paper, we answer all of the three questions above in the affirmative, and provide a vastly 
unified framework for generalized matrix completion. We address the first two questions by con¬ 
sidering a general matrix completion setting wherein observed matrix entries are sampled from any 
member of a rich family of natural exponential family distributions. Note that this family of distri¬ 
butions encompass a wide variety of popular distributions including Gaussian, Poisson, binomial, 
negative-binomial, Bernoulli, etc. Moreover, the choice of the exponential family distribution can 
be made depending on the form of the data. For instance, thin-tailed continuous data is typically 
modeled using the Gaussian distribution; count-data is modeled through an appropriate distribution 
over integers (Poisson, binomial, etc.), binary data through Bernoulli, categorical-discrete through 
multinomial, etc. We address the last question by considering general structural constraints upon 
the underlying matrix, as captured by a general regularization function 7Z(.). Our general matrix 
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completion setting thus captures heterogeneous noise-channels, for heterogeneous data-types, and 
heterogeneous structural constraints. 

In a key contribution, we propose a simple regularized convex M-estimator for recovering the 
structurally constrained underlying matrix in this general setting; and moreover provide a unified 
and novel statistical analysis for our general matrix completion problem. Following a standard 
approach Negahban (2012), we (a) first showed that the negative log-likelihood of the subset of 
observed entries satisfies a form of Restricted Strong Convexity (RSC) (Definition [4]); and (b) un¬ 
der this RSC condition, our proposed M-estimator satisfies strong statistical guarantees. We note 
that proving these individual components for our general matrix completion problem under general 
structural constraints required a fairly delicate and novel analysis, particularly the first component 
of showing the RSC condition, which we believe would be of independent interest. A key corol¬ 
lary of our general framework is matrix completion under sub-Gaussian samples and low-rank 
constr aints, where we show that our theorem recove r s results comparable to the existi ng litera¬ 


ture 


Candes and Planl (l2010h : iKeshavan et al.1 (1201 Obi) : iNegahban and Wainw right (2012). Finally, 


we corroborate our theoretical findings via simulated experiments. 


1.1 Notations and Preliminaries 


In this subsection we describe the notations and definitions frequently used throughout the paper. 
Matrices are denoted by capital letters, X , 0, M, etc. For a matrix M, Mj and MW are the j th 
column and i th row of M respectively, and M hj denotes the ( i,j) th entry of M. The transpose, 
trace, and rank of a matrix M are denoted by M', tr(M), and rk(M), respectively. The inner 
product between two matrices is given by (. X , Y) = tr(Xi’Y') = XijYjj. 

For a matrix M £ K m x " of rank r, with singular values <xi > <Ji > ... a r , commonly used ma¬ 
trix norms include the nuclear norm ||M||* = JT a t , the spectral norm ||M ||2 = 0 i, the Frobenius 

norm ||M||i? = \J~Yhi of, and the maximum norm ||M|| max = max^ j) M^-. 

Given any matrix norm || • ||, the dual norm, || ■ ||* is given by ||^C||* = sup|iy|i <1 (A', Y). 

Definition 1 (Natural Exponential Family). A distribution of a random variable X, in a normed 
vector space is said to belong to the natural exponential family, if its probability density function, 
characterized by the parameter 0 in the dual vector space, is given by: 

P(X\@) = h(X) exp ((X, 0) - G(0)) , 

where G(Q) = log J x h(X)e' X '^dX, called the log-partition function, is strictly convex, and 
analytic. 


Definition 2 (Bregman Divergence). Let <j> : domfoi) —>• M be a strictly convex function differ¬ 
entiable in the relative interior of dom(0). The Bregman divergence (associated with <f) between 
x € dom(0) and y € ri(dom((/;)) is defined as: 

B<p(x, y) = 4>{x) - 4>{y) - (\7f(y),x - y). 


Definition 3 (Subspace compatibility constants). Given a matrix norm we define the following 
maximum and minimum subspace compatibility constants of 1Z(.) w.r.t the subspace A4: 


V(M;7l)= sup 

0GM\{O} 


= mf 


7Z(Q) 


Thus, V© € M 


’Fr 


e^fo} ||0 ||f ' 
WI|0||f <n{Q)<^{M,n)\\Q\\ F . 
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In the rest of the paper, to avoid notational clutter, the dependence of T and T m i n on R, is suppressed. 
Thus, and T nim (X) are denoted as T'(Ad) and T lllin , respectively. 

Definition 4 (Restricted Strong Convexity). A loss function C is said to satisfy Restricted Strong 
Convexity with respected to a subspace S, if for some pc > 0, 

£(0 + A) - £(©) - (V£(0), A) > [icW^Wl, VA € S. 


Definition 5 (Sub-Gaussian Distributions). A mean zero random variable X is said to have a sub- 


sXl 


< e ! 


2 ft 2 / 2 


Gaussian distribution with parameter b if V s > 0, the distribution satisfies E[e 
Further, if X is sub-Gaussian with parameter b and E[X] = 0, then Var(X) < if (Vershynin 

fcoioh i. 


2. Exponential Family Matrix Completion 

Denote the underlying target matrix by 0* e R mxn . We then assume that individual entries 0*- arc 
observed indirectly via a noisy channel: specifically, via a sample drawn from the corresponding 
member of natural exponential family (see Definition [Q: 

P(Xij\etj) = h(Xij) exp (XijQC - G(0T)) , (D 

where G : R —>• R is a strictly convex, and analytic function called the log-partition function. 

Consider the random matrix X € M mxn , where each entry X l} is drawn independently from 
the corresponding distribution in dT]); it can be seen that: 

P(X|0*) = H (h(X tJ ) exp (X tj % - G(0*j))) 
ij 

= h(X) exp ((X, 0*) — G(&*)), (2) 

where we overload the notation to denote G : M mxn —>• R as G'(0) = G(@ij), and the base 

measure h(X) as h(X) = ■ h(xij). 

Uniformly Sampled Observations: In a “fully observed” setting, we would observe all the 
entries of the observation matrix X € M mxn . However, we consider a partially observed setting, 
where we observe entries over a subset of indices 0 c [m] x [n]. We assume a uniform sampling 
model, so that 

V (i,j) € 0, i ~ uniform([m]), j ~ uniform([n]). (3) 

Note that, under the above described sampling scheme, an index (i,j) can be sampled multiple 
times, in such cases we include the multiple instances of (i, j) in 0 (and not just the unique indices 
in 0). Given 0, we define the following matrix Vq(X) as 

V n (X)= J2 
(j,i)eo 

The matrix completion task can then be stated as the estimation of 0* from (0, Vn(X)), where 
X ~ P(X|0*). As noted earlier, this problem is ill-posed in general. However, as we will show, 
under structural constraints imposed on the parameter matrix 0*, we are able to design an M- 
estimator with a near optimal deviation from 0*. 

2.1 Applications 

Gaussian (fixed cr 2 ) is typically used to model continuous data, x € R, such as measurements with 
additive errors, affinity datasets. Here, G{9) = \o 2 6 2 . 
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Bernoulli is a popular distribution of choice to model binary data, x € {0,1}, with G(9) = 
log (1 + e 6 ). Some examples of data suitable for Bernoulli model include social networks, gene 
protein interactions, etc. 

Binomial (fixed N) is used to model number of successes in N trials. Here, xe {0,1,2,..., N}, 
and G(6) = N log (1 + e e ). Some applications include predicting success/failure rate, survey out¬ 
comes, etc. 

Poisson is used to model count data x € {0,1, 2,...}, such as arrival times, events per unit time, 
click-throughs among others. Here, G(6) = e e . 

Exponential is often used to model positive valued continuous data x € R+, specially inter arrival 
times between events. Here, G{9) = — log (— 9 ). 


2.2 Log-likelihood 


Denote the gradient map: 


5(0) = VG(0) € 


where g(Q)ij = 


8G(&) 


89 . 


v 


It can then be verified that the mean and variance of the distribution P(X |0*) are given by E[X] = 
p(0*), and Var(X) = V 2 G(©*), respectively. The Fenchel conjugate of the log partition function 
G, is denoted by: F(X) = sup e ( X , 0) — G(0). 

A useful consequence of the exponential family is that the negative log-likelihood is convex 
in the natural parameters 0*, and moreover has a bijection with a large class of B res man diver¬ 
gences (Definition [2]>. The following relation ship was first noted by Forster and Warmuthl (|2002), 
and later established bv lBanerjee et al.1 ( 2005 1 [Theorem 4]: 

- logP(X|0) oc B F (X,g(Q )), MX e dom(F). (4) 


2.3 Discussion and directions for future work 


We consider the standard matrix-completion setting where the distribution of the observation matrix 
Xin© corresponds to its entries X l} being drawn independently from the other entries. Further, 
the probability of observing a specific entry X t j, under uniform sampling is independent of the 
noise channel or the distribution P(Xjj|0T). However, in some applications, it might be beneficial 
to have a sampling scheme involving dependencies; for instance, when a user gives a movie a bad 
rating, we might want to vary the sampling scheme to sample an entirely different region of the 
matrix. It would be interesting to extend the analysis of this paper to such a dependent sampling 
setting. 

The form of the observation random matrix distribution in ©, given the individual observations 
in CD), can be seen to have connotations of multi-task learning: here recovering each individual 
matrix entry 0*- together with its corresponding noise model forms a single task, and all these tasks 
can be performed jointly given the shared structural constraint on 0*. It would be interesting to 
generalize this to form a more general statistical framework for partial multi-task learning. 

We use the general class of exponential family distributions as the underlying probabilis¬ 
tic model capturing the measurement uncertainties. The richness of the class of exponential 
family distributions has been used in other settings to provide general statistical frameworks. 
Kakade et al . ( 20101) provide a generalization of compressed sensing problem to general expo¬ 
nential family distributions. Note however that results from compressed sensing cannot be im¬ 
mediately extended to matrix completion case, since the sampling operator Vn does not satisfy 
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the typical assumptions (restricted isometry or restricted eigenvalues) made in such settings; see 


abilistic PCA 


4' 

) 9 |) 


dCandes a nd Recht, 2009) for additional discussion. There have been extensions of classical prob- 


Tipping and Bishop (1999 ) from Gaus s ian noise mod els to exponential family distri 


butions ICollins et a.1.1 (120011) : iMohamed et al.1 (120081) : iGordonl (120021) . There have also been recent 
extensions of probabilistic graphical model classes, beyond Gaussian and Ising models, to multi¬ 


variate extensions of exponential family distributions (Yang et ah, 2012, 120131) 


More co mplicated probabilistic models h ave also been proposed i n the c ontext of collabora¬ 
tive filtering Mnih and Saiakhutdinov ( 2007 ): Saiakhutdinov and Mnih ( 2008 ). but these typically 
involve non-convex optimization, and it is difficult to extend the rigorous statistical analyses of the 
form in this paper (and in the matrix completion literature) to these models. 


3. Main Result and Consequences 

As noted in the introduction, we consider the matrix completion setting with general structural con¬ 
straints on the underlying target matrix 0*. To formalize the notion of such structural constraints, 
we follow (Negahban, 2012 ). and assume that 0* satisfies 0* € M. C Ad C R mxn , for some 
subspace Ad C Ad, which contains parameter matrices that are structured similar to the target 
(the corresponding structural constraints such as low rankness, low rankness+sparsity etc); we also 
allow the flexibility of working with a superset Ad of the model subspace that is potentially eas¬ 
ier to analyze. Moreover, we use their definition of a decomposable norm regularization function, 
72(.) : M mxn —» R. |_, which suitably captures these structural constraints: 

Assumption 1. ( Decomposable Norm Regulcirizer) We assume that 72 (.) is a matrix norm, and is 
decomposable over (Ad, Ad ), i.e. ifX € Ad, Y € Ad , then, 

n{x + y) = n{x) +n(Y). 

We provide some examples of such decomposable regularizes and structural constraint sub¬ 


spaces, and refer to (Negahban, 2012) for more examples and discussion. 


Example 1. Low-rank: This is a common structure assumed in numerous matrix estimation 
problems, particularly those in collaborative filtering, PCA, spectral clustering, etc. The correspond¬ 
ing structural constraint subspaces (Ad, Ad" 1 ) in this case correspond to a linear span of specific 
one-rank matrices; we discuss these in further detail in Section 13.21 where we derive a corollary 
of our general theorem to the specific case of recovery guarantees for low-rank constrained matrix 
completion. The nuclear norm 72(0) = [|0L = T ~\ u ay,, has been shown to be decomposable with 


respect to these constraint subspaces Fazel et ah (20011). 


Example 2. Block sparsity: Another important structural constraint for a matrix is block- 
sparsity, where each row is either all zeros or mostly non-zero, and the number of non-zero rows 
is small. The structural constraint subspaces in this case correspond to a linear span of specific 



the i th row of 0, the i\fl q norm is defined as: 


(Oil _ 


m n 

£[(£ 


0 ..19 
VJl] I 


!/<?■ 


l|eili,, = Ell 0(i "? 

2—1 2=1 j= 1 

Example 3. Low-rank plus sparse: This structure is often used to model low-rank matrices 
which are corrupted by a sparse outlier noise matrix. The structural constraint subspaces corre- 
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sponding to these consist of the linear span of weighted sum of specific rank-one matrices and 
sparse matrices with non-zero components on specified positions; and appropriate regularization 
function decomposable with respect to such structural constraints is the infimum convolution of the 
weigh t ed nuclear norm with weigh ted elementwise l\ norm, ||M||ip = JT. | M tJ ICandes et al. 
( 2011 1: Yang and Ravikurtiarl ( 2013 1: 

K(Q) = inf{A 1 ||S'||i,i + A 2 ||L||* : 0 = S + L}. 


The second assumption we make is on the curvature of the log-partition function. This is 
required to establish a form of RSC (Definition [4} for the loss function. 

Assumption 2. The second derivative of the log-partition function G : M —>• R has atmost an 
exponential decay, i.e, 

\7 2 G(u) > VmGI, for some p > 0 

It can be verified that commonly used members of natural exponential family obey this assump¬ 
tion. 


Finally, we make an assumption to avoid “spiky” target matrices. As ICandes and Recht (2009) 
show with numerous examples, low-rank and presumably other such structural constraints as above, 
by themselves are not sufficient for accurate recovery, in part due to the infeasibility of recov¬ 
ering overly “s piky” matrices with very few large entries. Early work Candes and Plan (2010); 
Keshavan et al.1 (1201 Oalibl) . a ssumed stringent matrix incoherence conditions to preclude such matri¬ 
ces, while more recent work lDavenport et al.1 (120121) : iNegahban and Wainwrightl (120121) . relax these 
assumptions to restricting the spikiness ratio, defined as follows: 

~tfVti 1 1 011 max 


a 


sp 


( 0 ) = 


lieilj 


(5) 


Assumption 3. There exists a known a* > 0, such that 

||01 max = ^D||0*| 


Imn 


F < 


a 


'mn 


3.1 M-estimator for Generalized Matrix Completion 

We propose a regularized M-estimate as our candidate parameter matrix 0. The norm regularizer 
72(.) used is a convex surrogate for the structural constraints, and is assumed to satisfy A[[| For a 
suitable A > 0, 

0= argmin ^ ^ - log P(X ij |0 u ) + XTZ(Q) 


argn “k mi 






+ A K(9). 


( 6 ) 


||©||max< 

= argmin ^ G(©y) - 

lieiUax^ ije n 

In the above estimator, for simplicity we have assumed that the domain of the minimizing function 
spans all or M mxn . In cases where this is violated, additional constraints to restrict 0 to the domain 
could be imposed on the estimator and the results and analysis in the following section still hold. 
The above optimization problem is a convex program, and can be solved by any off-the shelf convex 
solvers. 
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3.2 Main Results 


Without loss of generality, suppose that m < n. Let 1Z*{.) = sup K ^j <1 (X, .) be the dual norm of 
the regularizer 7Z[.). Further, let T (A4) and 'I' min be the maximum and minimum sub space com¬ 
patibility constants of 1Z w.r.t the model subspace M. (Definition l3lFl We next define the following 
quantity: 


HTz(n, |D|) := E 



^ e ij e i e j) 


where the expectation is over the random sampling index set f l, and over the Rademacher sequence 
{eij : ) £ fi}; here {e^ £ M m }, {ej £ M n } are the standard basis. This quantity kk(ti, |D|) 

captures the interaction between the sampling scheme and the structural constraint as captured by 
the regularizer (specifically its dual 71*). Note that it depends only on n (n > m), and on the size 
|fl| of fi. 


Theorem 1. Let 0 be the estimate from © with j > j^7Z* (Vq(X — g((~)*)). Under the As¬ 
sumptions 0-0 if\Q\ > Co'& 2 (Xi)nlogn) for large enough Co, then given a constant /3 > 0, 3 a 
constant Kg > 0 such that, using pc '■= e (Kp — > the following holds with 

probability > 1 — 4e —log2 n : 


lie 


0*|||* < 4' 2 (Af)max 


3A 2 CQa* 2 nlogn 

Wc Pi 


provided pc > 0 . 


if, 


are constants from Assumptions [2] and [3j re- 
(2012), as well as Negahban and Wainwright 


In the above theorem, r/ and a* > a.9p ('0*)l|0* 
s pectiv ely. Our proof uses elements from Negahhar 
(2012) where they analyze the case of low-rank structure and additive noise, and establish a form 
of restricted strong convexity (RSC) for squared loss over subset of matrix entries (closely relates 
to the special case, when the exponential family distribution assumed in © is Gaussian). However, 
showing such an RSC condition for our general loss function over a subset of structured matrix 
entries involved some delicate arguments. Further, we provide a much simpler proof that moreover 
only required a low-spikiness condition rather than a multiplicative spikiness and structural con¬ 
straint. Moreover, we are able to provide an RSC condition broadly for general structures, and the 
negative log-likelihood loss associated with general exponential family distribution. 


3.3 Corollary 

An important special case of the problem is when the parameter matrix 0*, is assumed to be of a 
low-rank r <C m. The most commonly used convex regularizer to enforce low-rank is the nuclear 
norm. The intuition behind the low-rank assumption on the parameter matrix is as follows: the 
parameter ©*-, can be thought of as the true measure of affinity between the entities corresponding 
to row i and column j, respectively; and the data X l3 , is a sample from a noisy channel parametrized 
by @ij. It is hypothesized that {0£ }, are obtained from a weighted interaction of a small number 
of latent variables as, ©*■ = Ylk=i 17 k u ik v ]k- 

*We suppress the dependence of 4' and 'bmm on 1Z in our notation to avoid notational clutter 













Let {uk £ M m } and {vk € IR™}, k € [r] be the left and right singular vectors, respectively 
of 0*. Let the column and row span of 0* be U* = col(0*) = spanjui} and V* = row(0*) = 
span{vj}, respectively. Define: 

M := {0 : row(0) C V* , col(0) C U*}, and 

(J]') 

M 1 - := {0 : row(0) C F* x , col(0) C U* L }. 

It can be verified that, Ad / Ad , however, Ad C Ad. 


Corollary 1. Let 0* be a low-rank matrix of rank atmost r <C rn. Further, let V(i. j), (X i? — 
g(@*j)) are sub—Gaussian (Definition [5j) with parameter b, and Q > corn log n for large enough 
constant cq. Given any /3 > 0, there exists constants cp > 0, Cp > 0 and Kp > 0, such that using 

TZ(.) = ||.||* and ^ := cp^/mnb^Jin ©, w.p. > 1 — log2 71 — e~^ 1+ ^ lo §( n ), 


10 -©■' 


mn 

2ria* 


2 „ rna x{b 2 ,a* 2 /mn} 

\F < - 2 - 

n 


rn log n 


where pc = Kp& Vmn > 0. 

Remark 1: Note that the above results hold for the minimizer 0 of the convex program in ©, 
optimized for any a* > a< jp (0*)||0*||^; in particular it holds with a* = a sp (0*)||0 *||f, where 
1 < o:. S7 ,(0*) < yjmn. While in practice a* is chosen through cross-validation, the theoretical 
bound in Corollary [j] can be tightened to the following (if ||0 ||f > 1): 


||© — 0* 


<Cp- 


a\ 


sp 


( 0 * 


I max {6 2 ,1} f rn log n 


Pc 


( 8 ) 


I|0*II 2 f 

Similar bound can be obtained for Theorem [T] 

Remark 2: b 2 is a measure of noise per entry; V(i. j), Var(X i; — g(Q*j )) < b 2 . Note that as we 
do not make stronger matrix incoherence assumptions, only an approximate recovery is guaranteed 
even as b —>■ 0. 


4. Proof 

In this section we provide key steps in the proofs of the main results (Section I3.2H3.3I) . Proofs of 
intermediate lemmata are deferred to the appendix. 


4.1 Proof of Theorem [T| 

The proof of our main theorem involves two key steps: 

• We first show that, under assumptions Assumption [l]-[3l RSC of the form in Definition @]holds 
for the loss function in © over a large subset of the solution space. 

• When the RSC condition holds, the result follows from a few simple calculations; we handle 
the case where RSC does not hold separately. 

Let A = 0 — 0*. We state two results of interest. 


Lemma 1. We define the following subset: 

V = {© € R mx ” : n(Q-x) < 

where recall (M ., from Assumption [7] and Ojp is the projection of © onto the subspace Ad. 
IfQ is the minimizer of ©, and ^ >jrf'R,*('P(>(X — g(&*)), then A = 0 — 0* € V. 

The proof follows from Lemma 1 of Negahban ( 2012t) . □ 


9 


















Lemma 2. Let 0 be the minimizer of ©. If ^ > ^TZ*(Vq(X — g(Q*)), then: 


mn 

w 


E B G(®iL%) 

(ij)en 


The proof is provided in Appendix IA.21 


< 


3\V(M) 

2 



□ 


Recall the notation a sp (A) = . We now consider two cases, depending on whether 

the following condition holds for the constant Co > 0 dictated by Theorem Q] 

a sp (A)<—(9) 
coT'(Ad) y nlogn 

Case 1: Suppose condition in ([9]) does not hold; so that a sp (A) > e y Eoi n • hi' om the 

constraints of the optimization problem ([6]), we have that ||A|| max < 1 1 © 1 1 max + II©* Umax A 
(2 a* /y/mn). Thus, 


IIAIIf 


/mn 


Osp(A) 


< 2c 0 a* 


' ^t 2 (M.)nlogn 


|0| 


( 10 ) 


Case 2: Suppose condition in Q does hold. Then, the following theorem shows that an RSC 
condition of the form in Definition 0] holds. 


Theorem 2 (Restricted Strong Convexity). For cq given by Theorem [7] let a sp ( A) < 
- \J n\ogn - Bor l ar 8 e enough cq, given any constant (3 > 0, there exists constant Kp > 0 

such that, under the assumptions in Theorem\T\ w.p. > 1 — 4e _ , l+; t)'l , I 1 lllll log 2 n. 


mn 

w 




where pc = e 


2rjot 

Wmn 


Kp- 


|f2|n: 


2 (n,|n|) 


nlogn 


As noted earlier, such an RSC result for the s pecia l case of squared loss under low-rank con¬ 
straints was shown iniNe gah b an and Wai nw right (2012). We prove this theorem in Section l4~3l 


Remaining steps of the proof of Theorem Q} Thus, if a sp (A) < and > °’ 

from Theorem |2] and Lemma [3 w.h.p.: 


Pc ||A|| 2 f < 


mn 

w 


E s g(0, 

ij£Q 




3Atf(M). 


( 11 ) 


From m and (ITTI ). under assumptions of Theorem [I] w.p. > 1 


4e lo §“ n ? we have: 


A|||i < T ,2 (Ad)max 


3A 2 a* 2 c^n log n \ 

W n r 
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4.2 Proof of Corollary |Tj 

From the definition of jY in ©, we have M. = spanjuiX^, yv£ : x E M n , y E M m }. Let 
Pit* E W nxm and Py* E M nxn , be the projection matrices onto the column and row spaces (U *, 
V*) of 0*, respectively. We have, YX E M mxn , X M = P V *X + XP V * - P V *XP V *. Also, 
rk (Pu*) = rk(Py») = rk(0*) = r. Thus, V<b E A4, rk(4>) < 2r; and hence, 


T'(Af) = sup 


|$| 


$eAt\{o} 


< s/2r. Further, T llim = 1. 


Next, we use the following proposition by Negahban and Wainwright (2012), to bound 
Kfc(n, 01) in Theorem [T] 


Lemma 3. If Q C [m] x [n] is sampled using uniform sampling and |fl| > nlogn, then for a 
Rademacher sequence {e, :] . V(i, j) E Q}, 


E 


1 

W\ 


Y s/mnejjeje* || 2 < 10-i 
ijEfl 


I n log n 




This follows from Lemma 6 of Nega hba n and Wai nwright 2012|), using Q > n log n. □ 

Thus, for large enough cq > 640, using Kn(n, |fl|) = 10 " j ( /jj" in Theorem 0 for some 


Y > 0 we have: 


_ 2rja* 

II £ = e y/™™ 


jy- 640 \ , 

I\p - J = Kpe v™. 


( 12 ) 


Finally, to prove the corollary, we derive a bound on \\Vq(X — gr(0*))|| 2 using the Ahlswede- 
Winter Matrix bound (Appendix IA.3I ). Let fix) = 'ifiix) = X — 1; and let = £rmi{X l:) — 
such that, ^\\p n (X - g(Q*))h = IIm E Y Hj) h- 




ij£fl 


From the equivalence of sub-Gaussian definitions in Lemma 5.5 of Vershvnin (12010I) . there 
exists a constant c\ such that | X t3 — g(0T)||</, < c\b, Yij. Since, Y (lJ ) has a single element, 
y/rrm(Xij - g(@*j)), we have, \\Y^\\^ 2 < c\ sjrnrih. Further, 


E[yM T yY = M[mn(Xij - g{%)) 2 e je *] ( = } mnE feeQ) [E x [(X li - p(0p 2 ]e,e*] 

< mnb 2 E^ jen / e j e *j\ ~ mnb 2 Y nxn , (13) 

where (a) follows from Fubini’s Theorem, (b ) follows as (X t j — g(@*j )) is 6-sub-Gaussian, and 
(c) follows from the uniform sampling model. Similarly, E[y(0)y(0) T ] = rnnb 2 I mXm . Define 
<7?- := max{E[yfe') T y^')],E[yfe')yto) T ]} < nb 2 

In LemmalU using a 2 := £ijen a ij = n\^\b 2 , XI = c\ y/mrib < c\nb, and t = |f!|<5, we have: 

p(|| ^y(^)|| 2 > 5 ) < n 2 max | e 4nb^ ; e 2c l" 6 |. 

' ' ij&n 
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Figure 1: Appropriate error metric between observation matrix X, and the MLE estimate from © 
X, plotted against “normalized” sample size, when X is generated from (a) Gaussian, 
( b ) Bernoulli, and (c) binomial distributions 


If 101 > cn log n for large enough c > 0, then for any constant C, using 5 = Cb^J — j^p , 

F (if l|F “ (X -» (0 ' ))l1 ^ 2 Cb ^W) S » 2 ^ l0S ” <>4) 

Re-parameterizing the constants, we have for /3 > 0, 3Cp > 0 such that w.p. > 1 — fP 1 log ”, 
PPP-\\Vvl(X - g {®*))|| 2 < Cpb^j Thus, using TVin > 1, pc = K' p e~V^ (from dT2])), 
and ^ := Cp\Jmnb ^ j n j °^ n in Theorem [T|leads to the corollary. 


4.3 Proof of Theorem H] 


This proof uses symmetrization arguments and contractions (l.edoux and Talagra n d (1991) Ch.4& 
6). We observe that, V (i, j) e 12, 3vij € [0,1], s.t. 


B G {Qij, 0T) = G(0y) - G(Q*j) - 5(0T)(0y - 0T) 

= V 2 G((1 - % )0T + VijGij)Afj > (15) 


where (a) holds as |(1 - %)0T + %0ij| < ||0*|| m ax + ||@||max < ^=, and V 2 G(tt) > 
e _T? l“l (Assumption |2]). 

Lemma 4. Under Theorem\2\ consider the subset 


£ = {A g V : a sp ( A) < 


1 

c 0 ^(M) 



Given any constant (3 > 0, there exists a constant kp > 0, such that w.p. > 1 — 4e ( 1 +4)' I 'mii 1 lo § 2 n t 
V A € £: 


mn V A 2 1 < 167 ^(^) kpTZjA) 

1^1 13 ~c Q V(M))I nlogn c§tf(A*)‘ 

The proof is provided in Appendix I A. 11 


□ 
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From the assumptions in Theorem [2] and Proposition [H 


€ 8. Further, as A £ V, 'JZ(A) < 


1Z( A^j-) + lZ(A—±) < 4 TZ(Ajj) < 4T'(7W)||A||^. Using FemmalU and (fl5l) . and choosing 


|fl| > C()4i-(M )n log n, for large enough cq, we have Kp := 1 — > 0. Finally, using //£ := 


-^s( K , 64 t / |nK(n,|n|) 


Co 


nlogn 


; if > 0, then w.h.p., 


TqT > VC\\&\ 

' ' ijen 


2 

F- 


(16) 


5. Experiments 

We provide simulated experiments to corroborate our theoretical guarantees, focusing specifically 
on Corollary [U where we consider the special case where the underlying parameter matrix is low- 
rank, but the underlying noise model for the matrix elements could be any of the general class 
of exponential family distributions. We look at three well known members of exponential family 
suitable for different data-types, namely Gaussian, Bernoulli, and binomial, which are popular 
choices for modeling continuous valued, binary, and count valued data, respectively. 


5.1 Experimental Setup 


We create low-rank ground truth parameter matrices, 0* € M"' xn of sizes n E 
{50, 100, 150, 200} (for simplicity we considered square matrices, m = n); we set the rank 
to r = 2 log n. The observation matrices, X, are then sampled from the different members of ex¬ 
ponential family distributions parameterized by 0*. For each n, we uniformly sample a subset U 
entries of the observation matrix X, and estimate 0 from (O. 

Evaluation: 

For each member of the exponential family of distributions considered, we can measure the per¬ 


formance of our M -estimator in parameter space as 


lie-©" 


lie- 


p - 

Hf 


II2 

IIf 


or in observation space using an 


appropriate error metric err(X, X), where X is the maximum likelihood estimate of the recovered 
distribution, X = argmax x P(X\Q) (we use RMSE, MAE in our plots). From our corollary, we 
require |U| = O(rn log n) samples for consistent recovery, so we plot the error metric against the 


the “normalized” sample size, 


\n\ 


For reasons of space, we only provide results for the error 


rn log n ’ 

metric in observations space plotted against the the “normalized” sample size. The remainder of 
the results are provided in Appendix B. It can be seen from the plots that the error decays with 
increasing sample size, corroborating our consistency results; indeed Q > 1.5rnlogn samples 
suffice for the errors to decay to a very small value. Further, the aligning of the curves (for different 
n) given the “normalized” sample size corroborates the convergence rates as well. 
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Appendix A. Proofs of Lemma 

A.l Proof of Lemma|4| 


Recall that V = {A : 72(A-p- x)< 372.(A^-)}. To prove Lemma 4, consider the nuclear norm ball 
Sn(t) = {A : 72(A) < 7}. 


1. Show that, P 


sup 

AefnS K (t) 


M S ijeCl ^ij 1 


> 


St 


|r2|K 2 (n,|n|) fcgt 


is small; 


co'I'(At) V nlogn ^ 2c^(M) 

where k (n, |fl|) is a quantity that depends only on the dimensions n and |f2|. This is done by: 


sup 

L Ae£nS K (0 


(a) Bounding the expectation, E 

(b) Showing an exponential decay of the tail. 


mn a 2 _ -i 

pf 2-^ijen ^ij 1 


2. Then use a peeling argument Raskut ti et al.l ( 20101) to derive at the result in Lemma 0] 

A.1.1 Bounding Expectation 

Note that V A € £, E[ ^ A;,-] = IIA | |% = 1. Thus, by using standard symmetrization 

argument (Lemma 6.3 of Ledoux and Talagrandl (1991), with a Rademacher sequence, {cy, V ij <£ 
fi}, we have: 


E 


sup 

Ae£n Sn(t) 


™ y a 2 --i 
|n| ^ 13 


2 mn 

< ——E 


ij SO 




sup 


Aefn Sn ( t)' ijen 




(17) 


Also, VA € <5, 4>ij(A) = 


A 2 . 

_ 

2 SUp || A|| m ax" 

Aef 


is a contraction, and VA € <5, IIAI 


ctsp(A) 


< 


_ ___ M 

c 0 ' H ( M ) y / rrm\l nlogn 

we have: 


. Thus, using Theorem 4.12 of Ledoux and Talagrand (1991) in Equation [171 


E 


sup 

Ae£n Suit) 


™ y A 2 - - 1 

|fi| A, 13 

1 1 ijesi 


< 


|0| 


coT’(Al) V nlogn 


-E 


sup 

Ae£ns n (t) 


<mn 




y. e ij e i e j > A 


(“) 87 




c 0 ^(A4) V nlogn 


-E 


'rrm. 


|f!| 


K ’(E 




6ij6iCj 




(18) 


where (a) follows from Cauchy—Schwartz and as 72(A) < 7. Note that 72 * [J2 ije n e ij e i e *j ) is 
independent of A and depends only on n and |fi|. Let n(n, |fi|) > E v j^|” 72* ( e ij e i e 


be a suitable upper bound. 


E 


sup 

Ae£n S K (t) 


mu v—' , 

1 1 ijen 




- 1 


<- 


87 / |f2|n 2 (n, |fi|) 


c 0 nlogn 


(19) 
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A.1.2 Tail Behavior 


mn A 2 _ 1 

|n| 


Let Gt(fi) = sup 

AefnSTj(t) 

differ from 0 in exactly one element. We then have: 


Let O' C \rn] x [nl be another set of indices that 


G t (S2) - Gt(Q') 


sup 

Ae£nS K (t) 


uni \—a . 

1 1 ij£fl 


- 1 


sup 

Ae£nS K (t) 


mn 

w 


E A h 

kieci' 



mn 

■ w 


sup 

Ae£r\S n (t) 



*22 A k> ~ 1 

fcZeH' 


2 mn o 2 

< SUP ll A llmax < 2 T 2 /~T~~7\ 1 - 

|U| AefnSTj(t) cg'J' 2 (A4)nlogn 


mn 

“M 


sup 

Ae£n S K {t) 



Hefi' 


By similar arguments on Gt( O') — Gt(fi), we conclude that |Gt(0) — Gt(0')| < c 2q,2(2)ni 0 gn ' 
Therefore, using Me Diarmid’s inequality, we have P(|C7/.(0) — E[Gt(0)]| > 5 ) < 

2 exp c °^ ^ log Fix 6 = 22(2) 4or a PP ro P r >ate constant k\. Recall that T'min = 

inf 2x^1 — Using |0| < n 2 , 


P(G t (Q ) > — 

[ 1 ' co*(M) 


|0|n 2 (n, |0|) 2k\t 


nlogn cl^(M) 


=2) < 2 exp (-2fc',f 2 TU 11 log 2 n) 


A. 1.3 Peeling Argument 

Consider the following sets, Sf = {A € £ : 2 £_1 \P min < 72(A) < 2^'f' min }, for all (integers) 
£ > 1. Since, VA € £, 72(A) > ’T m i n ||A ||f = $ m i n , for each A e f, A e for some £ > 1. 

| mn A 2 — 1 

|o| 2-^ijen L -*ij 


Further, if for some A e £, 


t 


lewjA) M + 2W1, then for some 
co'L(At) V nlogn cg'f(A'f) 


^ V A 2 - - 1 
ioi 


101 2 

1 1 ijeo 


> 


16(2^— 1 )T f m 
co’I'CM) 

8(2 £ ^ min ) 
c 0 ^(A7) \ 


|0|n 2 (n,|0|) 4/ci2^ _1 T' r 


n log n 


Cg'F(Af) 


|U|n 2 (n,|0|) 2fci(2^ n 


n log n 


c 2 *(M) 


Thus, 


( 20 ) 


PI sup 

\Ag£ 


mn 

W 


E A « - 1 


ij£fl 


> 


1672(A) /|0|n 2 (n,|0|) 4&i 72(A) \ 


cqT'(TW) V nlogn 


+ c 2 4/(A4) ; 


< 


E p ( G v(u) 


> 


8(2% r 


coT'(Ad) 


|0|n 2 (n,|0|) 2fci(2% 


*=1 
(a) 00 

< E 2 eX P (~ 4 lo § 2 k l^tnn lo g 2 Tl ) < 

t= 1 


nlogn ' Co^(TW) 
2 e -MjyPjog 2 n 

1 _ log 2 n 


^ <E 2ex P(-2^ 2l ®L l0 g 2 n) 


^=i 

< 4 e - 4fc f' I, min l0 g 2n 


( 21 ) 
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where (a) follows as x > log x for x > 1, and the last step holds for n > 1. The le mm a follows by 
re-parametrization of constants in terms of fi. 


A.2 Proof of Lemma [2] 

Let A = 0 — 0*. 

R(§) = R(e* + + A wi ) > R(e* +V)- R(A m ) = R(e*) + K(A m± ) - Rfe) 

( 22 ) 


The above inequalities hold due to triangle inequality, and decomposability of 7Z over &* £ A4 and 
A—± € M ± . 


mn 

W 


E = ITTrt E G{Q ij )-X ij Q ij -G{%) + X ij Q* ij + {Vn{X-9{&), A)] 


(jj)eo 






( a ) ^ mn — 

< XR{® ) - \K(@) + —K*(Vn(X - g(G*)))K( A) 

< X7Z(A^) - A^(A^x) + ^(A m + A^x) < ^U(A m ) - ^(A^ 


< 3 A * (Af) lie* - ©*db ^ ^r^ ll®* - ©lb 


(23) 


where (a) follows as 0 is the minimizer of © and using Cauchy Schwartz, ( b ) follows from 0 (l22l) 
and using ^7 Z*(Vn(X — g(Q*)) < and (c) follows from triangle inequality. □ 

A.3 Ahlswede-Winter Matrix Bound (Extension) 

The Orlicz norm of a random matrix Z £ M mx " w.r.t to a convex, differentiable and monotonically 
increasing function, <j>{x) : R + —> R as follows: 


Z\\(/> =inf{f > 0 : E b (|<Z, Z')\/t))] < 1, 
V Z' £ M mxn , and Z^ £ [0,1]} 


Lemma 5 (Ahlswede-Winter Matrix Bound). Let Z^\ Z^ 2 \ ..., Z (h > be random matrices of di¬ 
mensions m x n. Let < M, Vi. Further, o 2 = max {||E[Z« t Z«]|| 2 , ||E[Z«Z« t ]|| 2 }, 

and a 2 = a i> then: 


K 


P 


E zW 


2 > t I ^ 


1=1 


\ I 

ft 2 t | 

< mn max < 

e LP , e 2 m | 


The above lemma is an extension noted b y Vershvni n ( 20091) (Theorem 1 and a later remark) for 
the matrix bounds resulting from Ahlswede and Winter (2002). 
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Appendix B. Additional Experimental Results 


We provide the additional experimental results where we compare the error of the estimate in the 
parameter space. We plot the results first against the proportion of the total entries sampled, ^ 
(Figure on left), and then against the “normalized” sample size, r - (Figures on right). We 
observe trends similar to those observed in Section 5. Again, we find that the curves (for different n) 
given the “normalized” sample size, align and converge (left), corroborating the theoretical results. 
Note that, the curves do not align when plotted against, unnormalized sample size (right). Further, 
as with errors in observation space, with Q > 1.5rn log n samples, the errors parameter space also 
decay to a sufficiently small value. 



|f2|/mn |S7|/rnlogn 

Figure 2: Parameter Error when measured (a) against proportion of the sampled values, and ( b ) 
against the ‘normalized” sample size, when the distribution of the observations P(X |©*), 
is Gaussian 
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Figure 3: Parameter Error when measured (a) against proportion of the sampled values, and ( b ) 
against the ‘normalized” sample size, when the distribution of the observations P(X |©*), 
is Bernoulli 



Figure 4: Parameter Error when measured (a) against proportion of the sampled values, and (6) 
against the ‘normalized” sample size, when the distribution of the observations P(X|0*), 
is Binomial 
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